Skip to content

[EP]:Add MCCL all-to-all fallback for MACA EP#2592

Open
Dayuxiaoshui wants to merge 4 commits into
kvcache-ai:mainfrom
Dayuxiaoshui:ep-support-maca
Open

[EP]:Add MCCL all-to-all fallback for MACA EP#2592
Dayuxiaoshui wants to merge 4 commits into
kvcache-ai:mainfrom
Dayuxiaoshui:ep-support-maca

Conversation

@Dayuxiaoshui

@Dayuxiaoshui Dayuxiaoshui commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR stabilizes the MACA Expert Parallelism (EP) runtime path on a
single-node MetaX C500 setup. The submitted path does not use the experimental
MCCL all-to-all helper. It keeps the existing EP dispatch/combine kernels and
uses the TransferEngine device P2P/IPC fast path for intra-node GPU-to-GPU
payload movement.

The main goal is to make the MACA EP path compile, initialize, dispatch, and
combine reliably while keeping the change surface limited to MACA-specific
compatibility code.

Key changes:

  • Enable MACA EP builds through the torch extension path.
  • Use the TransferEngine device P2P transport for MACA EP intra-node payloads.
  • Keep IBGDA disabled on MACA when device-side RDMA is unavailable.
  • Use a MACA RDMA transport stub so EP can fall back cleanly to P2P-only mode.
  • Add MACA device operation compatibility wrappers.
  • Avoid exposing temporary debug and phase-fence environment variables.
  • Keep the MACA SEND/RECV phase fence as an internal compatibility step.
  • Preserve the existing CUDA/NVIDIA EP behavior as much as possible.

Current runtime behavior:

  • On MACA, EP reports IBGDA unavailable, using P2P-only path when IBGDA cannot
    be initialized.
  • The validated payload path is still the EP runtime P2P fast path:
    fallback=False, ibgda_disabled=True, fast=True.
  • For 4-GPU full mesh P2P testing across MetaXLink pairs, the test used
    MOONCAKE_EP_MACA_ALLOW_NODE_P2P=1.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Mooncake PG (mooncake-pg)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • Common (mooncake-common)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement
  • Other

How Has This Been Tested?

Build and static checks

python -m py_compile mooncake-wheel/mooncake/mooncake_ep_buffer.py
git diff --check

Result: passed.

The EP extension was rebuilt in the MACA torch environment:

cd mooncake-ep
MOONCAKE_EP_USE_MACA=1 \
MACA_PATH=/opt/maca \
MACA_HOME=/opt/maca \
TORCH_CUDA_ARCH_LIST=8.0 \
/opt/miniconda3/envs/py310/bin/python setup.py build_ext --build-lib . --force

Result: passed.

EP smoke tests

2-GPU smoke:

MOONCAKE_EP_USE_MACA=1 \
python scripts/metax/smoke_ep_p2p.py \
  --world-size 2 --tokens 128 --hidden 2048 --experts 16 --topk 2

Result:

smoke ok ranks=2 tokens=128 hidden=2048 experts=16 topk=2

4-GPU smoke with full node P2P enabled:

MOONCAKE_EP_USE_MACA=1 \
MOONCAKE_EP_MACA_ALLOW_NODE_P2P=1 \
python scripts/metax/smoke_ep_p2p.py \
  --world-size 4 --tokens 64 --hidden 2048 --experts 16 --topk 2

Result:

smoke ok ranks=4 tokens=64 hidden=2048 experts=16 topk=2

EP performance checks

The following measurements were collected on June 29, 2026 on a 4-GPU
MetaX C500 MACA node. Timing is max across ranks.

2 GPUs, tokens=2048, hidden=7168, topk=2, Gloo control plane:

config: ranks=2 tokens=2048 hidden=7168 experts=16 topk=2 backend=gloo
mode=runtime_p2p fallback=False ibgda_disabled=True fast=True
dispatch: avg=10289.70 us bw=5.72 GB/s
combine: avg=7946.85 us bw=7.39 GB/s
dispatch+combine: avg=18189.30 us bw=6.46 GB/s

4 GPUs, tokens=2048, hidden=7168, topk=2,
MOONCAKE_EP_MACA_ALLOW_NODE_P2P=1, Gloo control plane:

config: ranks=4 tokens=2048 hidden=7168 experts=16 topk=2 backend=gloo
mode=runtime_p2p fallback=False ibgda_disabled=True fast=True
dispatch: avg=10434.00 us bw=5.64 GB/s
combine: avg=8093.84 us bw=7.25 GB/s
dispatch+combine: avg=18516.72 us bw=6.35 GB/s

4 GPUs, tokens=2048, hidden=7168, topk=4,
MOONCAKE_EP_MACA_ALLOW_NODE_P2P=1, Gloo control plane:

config: ranks=4 tokens=2048 hidden=7168 experts=16 topk=4 backend=gloo
mode=runtime_p2p fallback=False ibgda_disabled=True fast=True
dispatch: avg=13126.86 us bw=8.97 GB/s
combine: avg=12778.48 us bw=9.19 GB/s
dispatch+combine: avg=25871.60 us bw=9.09 GB/s

Current Limitations

  • MACA IBGDA is not active in this path. Current measurements are P2P/IPC
    measurements, not GPU-initiated RDMA measurements.
  • On the tested node, default 4-GPU routing only enables P2P fast path for the
    close MetaXLink pairs. Full 4-GPU P2P testing requires
    MOONCAKE_EP_MACA_ALLOW_NODE_P2P=1.
  • Without full node P2P enabled, 4-GPU all-to-all traffic can fall back to the
    Python fallback path. That fallback is not the performance target of this PR.
  • A Mooncake PG control-plane run printed valid 2-GPU EP bandwidth, but one rank
    segfaulted during process-group teardown. Gloo control-plane runs completed
    cleanly, so this PR treats PG teardown as a separate follow-up issue.
  • The previous MCCL all-to-all helper experiment is intentionally not part of
    this submitted path.

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added or run tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used

AI assistance was used to inspect the MACA EP changes, summarize test results,
and draft this PR description.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MACA support and a PyTorch-based fallback implementation of the alltoall operation, including new CUDA kernels for fused packing, compacting, and reducing. The review feedback focuses on critical optimizations and safety improvements in the CUDA kernels, such as utilizing shared memory to reduce redundant global memory reads, adding bounds checks to prevent out-of-bounds accesses, and guarding kernel launches against empty inputs to avoid invalid configuration errors. Additionally, it is recommended to make the PyTorch alltoall fallback implementation stateless to prevent potential race conditions when multiple MoE layers are interleaved.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-ep/src/mooncake_ep_alltoall.cu Outdated
Comment on lines +155 to +174
int count = load_i32_words(rank_base + local_expert * fused_hidden);

int begin = 0;
for (int r = 0; r < src_rank; ++r) {
const nv_bfloat16* prev_rank_base =
recv_payload + r * fused_slots_per_rank * fused_hidden;
begin += load_i32_words(prev_rank_base + local_expert * fused_hidden);
}

if (m == 0 && threadIdx.x == 0) {
layout_range[local_expert * num_ranks + src_rank] =
(static_cast<int64_t>(begin) << 32) | static_cast<uint32_t>(count);
atomicAdd(recv_count + local_expert, count);
}

if (m >= count) return;

int src_begin = 0;
for (int e = 0; e < local_expert; ++e)
src_begin += load_i32_words(rank_base + e * fused_hidden);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In compact_dispatch_fused_kernel, the loop to compute begin and src_begin is executed redundantly by all 256 threads in the block. Since these values only depend on blockIdx.x (which is constant for all threads in the block), we can compute them once in thread 0 and share them via __shared__ memory. This significantly reduces redundant global memory reads and instruction overhead.

    __shared__ int shared_count;
    __shared__ int shared_begin;
    __shared__ int shared_src_begin;
    if (threadIdx.x == 0) {
        int count = load_i32_words(rank_base + local_expert * fused_hidden);
        shared_count = count;

        int begin = 0;
        for (int r = 0; r < src_rank; ++r) {
            const nv_bfloat16* prev_rank_base = 
                recv_payload + r * fused_slots_per_rank * fused_hidden;
            begin += load_i32_words(prev_rank_base + local_expert * fused_hidden);
        }
        shared_begin = begin;

        int src_begin = 0;
        for (int e = 0; e < local_expert; ++e) 
            src_begin += load_i32_words(rank_base + e * fused_hidden);
        shared_src_begin = src_begin;
    }
    __syncthreads();
    int count = shared_count;
    int begin = shared_begin;

    if (m == 0 && threadIdx.x == 0) {
        layout_range[local_expert * num_ranks + src_rank] =
            (static_cast<int64_t>(begin) << 32) | static_cast<uint32_t>(count);
        atomicAdd(recv_count + local_expert, count);
    }

    if (m >= count) return;
    int src_begin = shared_src_begin;

Comment thread mooncake-ep/src/mooncake_ep_alltoall.cu Outdated
Comment on lines +79 to +81
int64_t expert = topk_idx[i];
if (expert < 0) return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a bounds check for expert to prevent potential out-of-bounds memory access on counters and expert_offsets if expert >= num_ranks * num_local_experts.

    int64_t expert = topk_idx[i];
    if (expert < 0 || expert >= num_ranks * num_local_experts) return;

Comment thread mooncake-ep/src/mooncake_ep_alltoall.cu Outdated
Comment on lines +283 to +285
count_dispatch_kernel<<<ceil_div(total, 256), 256, 0, stream>>>(
topk_idx.data_ptr<int64_t>(), counts_by_expert.data_ptr<int32_t>(),
num_tokens, num_topk, num_local_experts);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If total is 0 (e.g., empty batch or no tokens), launching count_dispatch_kernel with 0 blocks will result in a CUDA invalid configuration error. We should wrap the kernel launch in a check if (total > 0) to handle empty inputs gracefully.

    if (total > 0) {
        count_dispatch_kernel<<<ceil_div(total, 256), 256, 0, stream>>>(
            topk_idx.data_ptr<int64_t>(), counts_by_expert.data_ptr<int32_t>(),
            num_tokens, num_topk, num_local_experts);
    }

Comment thread mooncake-ep/src/mooncake_ep_alltoall.cu Outdated
Comment on lines +298 to +304
pack_dispatch_fused_kernel<<<total, 256, 0, stream>>>(
reinterpret_cast<const nv_bfloat16*>(x.data_ptr()),
topk_idx.data_ptr<int64_t>(),
expert_offsets.data_ptr<int32_t>(), counters.data_ptr<int32_t>(),
reinterpret_cast<nv_bfloat16*>(send_payload.data_ptr()),
send_route.data_ptr<int64_t>(), num_tokens, hidden, num_topk,
num_ranks, num_local_experts, max_messages_per_rank);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If total is 0 (e.g., empty batch or no tokens), launching pack_dispatch_fused_kernel with 0 blocks will result in a CUDA invalid configuration error. We should wrap the kernel launch in a check if (total > 0) to handle empty inputs gracefully.

    if (total > 0) {
        pack_dispatch_fused_kernel<<<total, 256, 0, stream>>>(
            reinterpret_cast<const nv_bfloat16*>(x.data_ptr()),
            topk_idx.data_ptr<int64_t>(),
            expert_offsets.data_ptr<int32_t>(), counters.data_ptr<int32_t>(),
            reinterpret_cast<nv_bfloat16*>(send_payload.data_ptr()),
            send_route.data_ptr<int64_t>(), num_tokens, hidden, num_topk,
            num_ranks, num_local_experts, max_messages_per_rank);
    }

Comment thread mooncake-ep/src/mooncake_ep_alltoall.cu Outdated
Comment on lines +351 to +359
compact_dispatch_fused_kernel<<<num_ranks * num_local_experts *
max_messages_per_rank,
256, 0, stream>>>(
reinterpret_cast<const nv_bfloat16*>(recv_payload.data_ptr()),
layout_range.data_ptr<int64_t>(), recv_count.data_ptr<int32_t>(),
return_src_pos.data_ptr<int64_t>(),
reinterpret_cast<nv_bfloat16*>(packed_recv_x.data_ptr()),
packed_recv_src_info.data_ptr<int32_t>(), hidden, num_ranks,
num_local_experts, max_messages_per_rank, num_recv_slots);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If max_messages_per_rank is 0, launching compact_dispatch_fused_kernel with 0 blocks will result in a CUDA invalid configuration error. We should wrap the kernel launch in a check if (max_messages_per_rank > 0) to handle empty inputs gracefully.

    if (max_messages_per_rank > 0) {
        compact_dispatch_fused_kernel<<<num_ranks * num_local_experts *
                                            max_messages_per_rank,
                                        256, 0, stream>>>(
            reinterpret_cast<const nv_bfloat16*>(recv_payload.data_ptr()),
            layout_range.data_ptr<int64_t>(), recv_count.data_ptr<int32_t>(),
            return_src_pos.data_ptr<int64_t>(),
            reinterpret_cast<nv_bfloat16*>(packed_recv_x.data_ptr()),
            packed_recv_src_info.data_ptr<int32_t>(), hidden, num_ranks,
            num_local_experts, max_messages_per_rank, num_recv_slots);
    }

Comment on lines +669 to +675
self._torch_alltoall_state = {
"send_route": send_route,
"return_src_pos": return_src_pos,
"num_tokens": num_tokens,
"num_topk": num_topk,
"max_messages_per_rank": max_messages_per_rank,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _torch_alltoall_routed_dispatch and _torch_alltoall_routed_combine uses a stateful dictionary self._torch_alltoall_state to store send_route and return_src_pos. This can lead to subtle bugs or race conditions if multiple MoE layers are executed concurrently or interleaved. We can make the implementation completely stateless by wrapping send_route and return_src_pos inside the src_info tuple returned by dispatch and passed to combine via handle.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

self.num_ep_buffer_bytes = num_ep_buffer_bytes
self.backend = self.group
# NIC auto-detection happens inside ep.Buffer via Topology::discover().
_debug_init(self.rank, "before ep.Buffer")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need such debug messages?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. These messages were only used while diagnosing MACA initialization hangs and are not needed in the submitted path.

Comment on lines +284 to +288
if self.group_size == 1:
_debug_init(self.rank, "single-rank skip ipc handle export")
self._use_fallback = False
_debug_init(self.rank, "connect done fallback=False")
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to handle this special case?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. With a single rank there is no peer that can import an IPC handle, so exporting/gathering IPC handles is unnecessary. On MACA it also avoids an unnecessary driver IPC call during initialization. I added a comment to make this explicit.

self.connect()

def _maca_phase_fence(self, send_event: Optional[Any] = None) -> None:
if not _USE_MACA or _MACA_PHASE_FENCE in {"", "0", "off", "none"}:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case do we need the maca phase fence flag? I'm afraid introducing new env vars may cause understanding burdens to users, so if you could explain that in advance, it would be very helpful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally a diagnostic switch used to compare different MACA phase synchronization strategies while debugging the split SEND/RECV path. I agree it should not be exposed as a user-facing env var, so I removed the flag and diagnostic modes. The remaining fence is an internal MACA compatibility fence only.

if not _USE_MACA or _MACA_PHASE_FENCE in {"", "0", "off", "none"}:
return

backend = dist.get_backend(self.group)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we use gloo instead of mooncake-cpu?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only handle Gloo because some smoke tests pass a Gloo process group, so the tiny fence token uses CPU there while EP payload still stays on the P2P fast path.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which smoke tests? I may be missing some context.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to the local MACA validation script scripts/metax/smoke_ep_p2p.py, where Gloo is only used to isolate and validate the EP P2P payload path, not to replace mooncake-cpu in production.

Comment thread mooncake-ep/src/CMakeLists.txt Outdated
@@ -1,4 +1,7 @@
add_library(mooncake_ep ep_py.cpp mooncake_ep_buffer.cpp mooncake_ep_kernel.cu)
add_library(mooncake_ep

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better leave this file unchanged lol

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem

@UNIDY2002 UNIDY2002 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind updating the PR title to fit the actual impl?

Also, I have a small comment.

self, device: torch.device, dtype: torch.dtype = torch.int32
) -> torch.Tensor:
if not self._is_mooncake_backend():
return torch.ones((self.group_size,), dtype=dtype, device=device)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise a warning in this case, as running without mooncake-pg may lose the active_ranks features.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Mooncake EP Multi-Vendor GPU Adaptation Design [Feature Request]: MACA Support for Mooncake EP/PG

3 participants